Caio Raphael

Fixed-size group of threads executed together in lockstep.
These are hardware scheduling units — the smallest batch of threads that can be executed together.
The SM/CU scheduler runs one warp/wave at a time on its execution units.
- The SM does not literally execute a whole warp in a single cycle always; the SM issues instructions for a warp and can interleave instructions from multiple warps to hide latency.
- The warp is the smallest scheduling/issue granularity, but instruction dispatch and active lanes depend on pipeline and issue width.
- Reconvergence is implemented by hardware (and compiler) mechanisms; divergent threads are masked and the SM executes each path serially until reconvergence.
Wavefront (AMD) or Warp (NVIDIA).
Common sizes:
- NVIDIA warp: 32 threads.
- AMD Wavefront: 64 threads.
These threads share a program counter and execute the same instruction at the same time (SIMT model).
If threads diverge in control flow, the hardware masks off threads not taking the current branch until they reconverge.

Defined by you in Vulkan’s compute shader or GLSL/HLSL.
Group of threads that can share shared memory within a single SM/CU.
A workgroup is scheduled to a single SM/CU for its lifetime while active, but a workgroup may be split across multiple SMs over time if the runtime re-schedules (e.g., after preemption or context switch).
- Practically, code semantics assume the workgroup runs on a single SM until completion.
Size: arbitrary (within hardware limits), e.g., local_size_x = 256 .
Purpose: group of threads that:
- Can share shared memory / LDS .
- Can synchronize using barrier() calls.
Hardware: The entire workgroup runs within one SM/CU (so they can share its on-chip memory).
Example:
- If you set local_size_x = 256 , that’s 256 threads in the workgroup.
Subgroup (API Abstraction)
- Subset of threads in a workgroup that maps to a warp/wave (e.g., 32 threads).
- Enables warp-level operations (shuffles, reductions) without shared memory .
- Exposed in Vulkan/OpenCL; size is queried at runtime.

SM = Streaming Multiprocessor (NVIDIA terminology).
CU = Compute Unit (AMD terminology).
Hardware block that runs multiple warps/waves.
This is the fundamental hardware block that executes shader threads.
Has per-SM caches and shared memory.
- Registers (not shared between SMs).
- Shared memory (LDS).
- L1 cache (per-SM).
- Implication: If one SM’s L1 cache is filled with certain data, another SM won’t see it — coherence happens at L2.
Resource Partitioning :
- Fixed registers/thread (e.g., 255 regs/thread on NVIDIA Ampere).
- Shared memory configurable (e.g., 64–164 KB on NVIDIA).
Concurrent Execution :
- Runs multiple warps/waves simultaneously (e.g., 64 warps/SM on NVIDIA).
- Hides latency via zero-cost warp switching.
Each SM/CU has :
- Its own set of registers (private to threads assigned to it).
- Its own shared memory / LDS (Local Data Store), accessible to all threads in a workgroup.
- Access to L1 cache and special-function units (SFUs).
Vulkan equivalent: A workgroup in a compute shader runs entirely within one SM/CU.

A group of SMs/CUs that may share intermediate caches or specialized hardware.
NVIDIA :
- "Texture Processing Cluster" (TPC) or "Graphics Processing Cluster" (GPC)
- Groups 2–8 SMs sharing raster/tessellation units.
AMD :
- "Shader Array" (SA) in RDNA
- Uses “Shader Array” or “Workgroup Processor” as a cluster-like grouping.
- Groups 2 CUs sharing instruction cache/ray accelerators.
Shared at cluster level: Sometimes texture units, geometry units, or a shared instruction cache.